The rise of peer-to-peer platforms, particularly Airbnb, has transformed the landscape of short-term rentals and the overall housing market. Initially designed for casual hosts to share their spare rooms or properties, Airbnb has evolved into a significant player in the hospitality industry,leading to a blend of professional and non-professional hosts. This has sparked considerable debate regarding its implications for local communities (Chen, W., Wei, Z. and Xie, K., 2022).
Research has identified various negative effects associated with the professionalization of Airbnb hosting, such as increased rental prices and accordingly the decrease of available, affordable housing (Barron, K., Kung, E. and Proserpio, D., 2021). As a response, several cities have implemented regulatory measures aimed at mitigating these impacts although the effects remain a present issue (Garz, M. and Schneider, A., 2023). In Barcelona locals have responded to these issues by demonstrating for affordable housing, against the Airbnb tourism the city attracts. Mayor Jaume Collboni made a recent decision to ban short-term holiday rentals by 2028 which affects over 10,000 registered properties. With rents in Barcelona surging b 70% over the past decade and the broader trend of backlash against mass tourism globally, this decision highlights the complex interplay between economic interest and the need for affordable housing.
In light of these insights, this project aims to explore the complexities of host professionalism on Airbnb, with a particular focus on Barcelona. The primary objective is to analyze and compare the listings of professional versus non-professional Airbnb hosts in Barcelona. By focusing on host professionalism, defined as having five or more listings, we aim to understand how different hosting practices influence guest experiences, pricing strategies and the housing market in general.
Thus, this study aims to answer the following Research Question: To what degree do the variables from Airbnb listings influence the prediction of host professionalization?
Our study seeks to provide a nuanced understanding of the dynamics at play in Barcelona’s Airbnb rental landscape. We aim to identify key differences in hosting styles and assess potential implications of these differences. This includes how professional hosts might contribute to housing pressures compared to their non-professional counterparts. Ultimately, our research aims to inform regulatory frameworks that promote market transparency and support a balanced peer-to-peer platform economy.
The chosen dataset is from Inside Airbnb, a mission driven project
that provides data about Airbnb’s impact on residential communities. It
includes 16920 observations of 75 variables concerning Airbnb listings
in Barcelona. Key variables include the
host_listings_count, price,
neighborhood_cleansed, host_is_superhost and
number_of_reviews. It was retrieved from Kaggle (Jiang
2023).
Here we need to load the following R packages:
liver: use this package for MSE
naivebayes: use this for to implement naive bayes classification algorithm
ggplot2: used this package to visualize data in R
pROC: used to vislauize and analyse ROC curves
psych: used for descriptive statistics, reliability analysis and factor analysis
lubridate: used for date-time data
dplyr: used for data transformation and manipulation
Hmisc: used for data analysis (descriptive statistics, plotting, data imputation)
ggcorrplot: used to visualize correlation matrix
naniar: used for missing data
Here we load them as follows:
Peer-to-peer platforms such as Airbnb have revolutionized the way individuals engage in short-term rentals and acommodations. These platforms enable property owners to rent out their spaces directly to guests. In Barcelona, Airbnb has become a dominant force in short-term rental market, contributing to significant changes in local housing dynamics (Barron et al., 2020). The increasing professionalization of Airbnb hosts, particularly those managing multiple listings, raises important questions regarding the impact of these practices on local commmunities, housing affordability and regulatory measures. Research indicates that the rise of peer-to-peer rentals can lead to increased housing costs and reduced availability of affordable housing options (Barron, Kung, & Proserpio, 2021). Consequently, understanding the dynamics of Airbnb listings and the differences between professional and non-professional hosts.
The classification of Airbnb hosts into professional and non-professional categories allows us to investigate the broader impacts of the platform on housing dynamics. Professional hosts, defined as those managing five or more listings, might operate in a manner that prioritizes profit maximization and guest turnover (Abrate et. al. 2022). In contrast, non-professional hosts might engage in renting out their primary residences or spare rooms, often for supplemental income. This differentiation in hosting practices is expected to result in significant variations in pricing strategies, guest experiences, and overall market behavior (Miguel et. al 2024). Additionally, the experience offered by professional hosts, characterized by more amenities and higher service standards, may attract a different demographic of guests, further influencing market dynamics. Thus, the following hypotheses are proposed:
Ha: Professional hosts in Barcelona charge higher rental prices compared to non-professional hosts.
Hb: The guest experience ratings differ significantly between professional and non-professional hosts, with professional hosts receiving higher ratings.
The number of bedrooms offered in an Airbnb listing could be impacted by the hosts professionality status. Larger properties with more bedrooms are likely to attract higher nightly rates and cater to groups or families which can generate more income compared to single-bedroom units. Additionally, professional hosts may have the financial capacity to invest in larger properties. In contrast, non-professional hosts might offer their primary or secondary homes. These listings may be more personal and reflect smaller living spaces with fewer bedrooms. Thus the following hypothesis is suggested:
Hc: Professional hosts are more likely to list properties with a higher number of bedrooms compared to non-professional hosts.
The rise of Airbnb and its professionalization of the platform has motivated investing in real estate for short-term accommodation, especially near tourist areas and city centers. This has led to issues of unfordable housing, gentrification, and overtourism in cities like Barcelona. Based on Exploratory Spatial Data Analysis techniques conducted by Deboosere et al. (2019), the study shows that “listings that belong to professional hosts are more concentrated in city centers [and near accessible transit], and influenced more by the location of tourist attractions and hotels, than of non-professional [hosts].” As a result, the following hypothesis is suggested:
Hd: Professional hosts have listings more densely located in city centers and tourist attractions than non-professional hosts.
For Airbnb hosts to be considered super hosts, many variables are taken into account, including “maintaining a 90% or higher response rate” and “maintaining a 4.8 or higher overall rating” (What’s Required to Be a Superhost - Airbnb Help Centre, n.d.). These reputation systems hold high economic value, as they are seen as symbols of trust and “are crucial prerequisites for peer-to-peer rental and sharing” (Dann, n.d.). Professional Airbnb hosts approach their hosting as a business, running it like one as well, by ensuring timely communication (Abrate et al., 2021) and offering more standardized and polished service to guests; “professional hosts invest in their properties, offering high-quality photos, detailed descriptions, and desirable amenities like Wi-Fi, parking, professional cleaning service and more” (Chang & Li, 2020). Thus, these factors contribute to a higher level of guest satisfaction and ratings. To investigate this, we propose the following hypothesis:
He: Professional hosts tend to receive higher guest ratings compared to non-professional hosts due to their structured and business-oriented approach to property management
Hf: Professional hosts have higher response rates than non-professional hosts
Given the raw dataset, we must refine it to ensure its suitability for the algorithmic processing. The str() function gives an initial understanding of the dataset’s variables and their types, allowing us to identify necessary cleaning steps.
'data.frame': 16920 obs. of 75 variables:
$ id : num 6.73e+17 4.42e+07 1.70e+07 1.87e+04 5.54e+17 ...
$ listing_url : chr "https://www.airbnb.com/rooms/673276379194656210" "https://www.airbnb.com/rooms/44192271" "https://www.airbnb.com/rooms/17039441" "https://www.airbnb.com/rooms/18674" ...
$ scrape_id : num 2.02e+13 2.02e+13 2.02e+13 2.02e+13 2.02e+13 ...
$ last_scraped : chr "2022-09-10" "2022-09-10" "2022-09-10" "2022-09-11" ...
$ source : chr "city scrape" "city scrape" "city scrape" "city scrape" ...
$ name : chr "Habitación muy acogedora." "Cozy terrace apartment Apartamento con patio" "Apart. full equipped. 2 min to Subway lines L1, L9" "Huge flat for 8 people close to Sagrada Familia" ...
$ description : chr "Abrace la simplicidad en este lugar tranquilo y bien ubicado<br /><br /><b>The space</b><br />Estilo Zen. Tranq"| __truncated__ "A private terraced + 2 bedroom ground floor apartment with private entrance and furbished kitchen, with table a"| __truncated__ "Precioso apartamento ideal para parejas. Luminoso y práctico.<br />El apartamento está cuidado al detalle con e"| __truncated__ "110m2 apartment to rent in Barcelona. Located in the Eixample district, near the Sagrada Familia. It has a smal"| __truncated__ ...
$ neighborhood_overview : chr "El barrio es tranquilo y bien hubicado. Cerca del piso, hay farmácia, panaderías, supermercados y mercaditos."| __truncated__ "The neighbourhood is quiet with trees. Though it is residential, resturants, supermarkets and fruit shops are a"| __truncated__ "La zona dispone de servicios básicos y una excelente conexión con las principales líneas de metro, L1 y L9 sur."| __truncated__ "Apartment in Barcelona located in the heart of Eixample district, within only 150 m form the great Sagrada Fami"| __truncated__ ...
$ picture_url : chr "https://a0.muscache.com/pictures/miso/Hosting-673276379194656210/original/62f451b6-4200-4b40-8c9f-416b15669e1e.jpeg" "https://a0.muscache.com/pictures/2e579e6b-b717-444e-90b7-b8e0cf856440.jpg" "https://a0.muscache.com/pictures/02af8b09-c8ca-4ed7-86da-8b546b4bc030.jpg" "https://a0.muscache.com/pictures/13031453/413cdbfc_original.jpg" ...
$ host_id : int 51421682 200754964 114340651 71615 442972056 115783949 90417 135703 129000409 15171574 ...
$ host_url : chr "https://www.airbnb.com/users/show/51421682" "https://www.airbnb.com/users/show/200754964" "https://www.airbnb.com/users/show/114340651" "https://www.airbnb.com/users/show/71615" ...
$ host_name : chr "Maria Das Merces" "Nuria" "Pepa" "Mireia And Maria" ...
$ host_since : chr "2015-12-15" "2018-07-08" "2017-02-01" "2010-01-19" ...
$ host_location : chr "" "Barcelona, Spain" "" "Barcelona, Spain" ...
$ host_about : chr "Sou Bailarina y Terapeuta Integrativa. Trabalho com Dança Terapia e elementos do Yoga, Tai-Chi-Chuan e Medicina"| __truncated__ "I live in Barcelona. I love travelling and meeting people. I like hiking, I enjoy nature and also city life. " "" "We are Mireia (43) & Maria (45), two multilingual entrepreneurs loving Barcelona and having big experience in t"| __truncated__ ...
$ host_response_time : chr "within an hour" "within an hour" "within a few hours" "within an hour" ...
$ host_response_rate : chr "100%" "100%" "100%" "98%" ...
$ host_acceptance_rate : chr "100%" "100%" "97%" "93%" ...
$ host_is_superhost : chr "f" "t" "t" "f" ...
$ host_thumbnail_url : chr "https://a0.muscache.com/im/pictures/user/709d3dcd-4a41-4feb-bc30-e570472c183b.jpg?aki_policy=profile_small" "https://a0.muscache.com/im/pictures/user/0e6bed83-48c6-444c-8e26-36abaea21ad6.jpg?aki_policy=profile_small" "https://a0.muscache.com/im/pictures/user/8a1dc3f8-b149-421e-a8b8-ecd4aa4d47ad.jpg?aki_policy=profile_small" "https://a0.muscache.com/im/users/71615/profile_pic/1426612511/original.jpg?aki_policy=profile_small" ...
$ host_picture_url : chr "https://a0.muscache.com/im/pictures/user/709d3dcd-4a41-4feb-bc30-e570472c183b.jpg?aki_policy=profile_x_medium" "https://a0.muscache.com/im/pictures/user/0e6bed83-48c6-444c-8e26-36abaea21ad6.jpg?aki_policy=profile_x_medium" "https://a0.muscache.com/im/pictures/user/8a1dc3f8-b149-421e-a8b8-ecd4aa4d47ad.jpg?aki_policy=profile_x_medium" "https://a0.muscache.com/im/users/71615/profile_pic/1426612511/original.jpg?aki_policy=profile_x_medium" ...
$ host_neighbourhood : chr "" "" "" "la Sagrada Família" ...
$ host_listings_count : int 1 1 1 40 8 33 5 3 308 12 ...
$ host_total_listings_count : int 1 1 2 42 8 54 9 15 364 13 ...
$ host_verifications : chr "['email', 'phone']" "['email', 'phone']" "['email', 'phone']" "['email', 'phone']" ...
$ host_has_profile_pic : chr "t" "t" "t" "t" ...
$ host_identity_verified : chr "t" "t" "t" "t" ...
$ neighbourhood : chr "L'Hospitalet de Llobregat, Catalunya, Spain" "L'Hospitalet de Llobregat, Catalunya, Spain" "L'Hospitalet de Llobregat, Catalunya, Spain" "Barcelona, CT, Spain" ...
$ neighbourhood_cleansed : chr "la Bordeta" "la Maternitat i Sant Ramon" "Sants - Badal" "la Sagrada Família" ...
$ neighbourhood_group_cleansed : chr "Sants-Montjuïc" "Les Corts" "Sants-Montjuïc" "Eixample" ...
$ latitude : num 41.4 41.4 41.4 41.4 41.4 ...
$ longitude : num 2.13 2.11 2.12 2.17 2.12 ...
$ property_type : chr "Private room in condo" "Entire condo" "Entire rental unit" "Entire rental unit" ...
$ room_type : chr "Private room" "Entire home/apt" "Entire home/apt" "Entire home/apt" ...
$ accommodates : int 2 5 2 8 2 8 5 6 4 6 ...
$ bathrooms : logi NA NA NA NA NA NA ...
$ bathrooms_text : chr "1 shared bath" "1 bath" "1 bath" "2 baths" ...
$ bedrooms : int 2 2 1 3 1 4 3 2 1 2 ...
$ beds : int 2 4 1 6 1 7 4 3 1 2 ...
$ amenities : chr "[\"Ethernet connection\", \"Hangers\", \"Hot water kettle\", \"Microwave\", \"Toaster\", \"Extra pillows and bl"| __truncated__ "[\"Fire extinguisher\", \"Stove\", \"Air conditioning\", \"Cooking basics\", \"Private patio or balcony\", \"Ha"| __truncated__ "[\"Stove\", \"Cooking basics\", \"Security cameras on property\", \"Hangers\", \"TV\", \"Microwave\", \"Iron\","| __truncated__ "[\"Kitchen\", \"Hot water\", \"Host greets you\", \"Long term stays allowed\", \"Wifi\", \"Shampoo\", \"Heating"| __truncated__ ...
$ price : chr "$59.00" "$110.00" "$86.00" "$180.00" ...
$ minimum_nights : int 1 3 3 1 2 31 5 2 1 1 ...
$ maximum_nights : int 1125 30 10 1125 365 1125 300 31 1125 1120 ...
$ minimum_minimum_nights : int 1 3 3 1 2 31 4 2 2 1 ...
$ maximum_minimum_nights : int 1 3 3 3 2 31 7 2 6 3 ...
$ minimum_maximum_nights : int 1125 1125 10 1125 365 1125 1125 31 3 1120 ...
$ maximum_maximum_nights : int 1125 1125 10 1125 365 1125 1125 31 1125 1120 ...
$ minimum_nights_avg_ntm : num 1 3 3 1.6 2 31 5.5 2 4.9 1.5 ...
$ maximum_nights_avg_ntm : num 1125 1125 10 1125 365 ...
$ calendar_updated : logi NA NA NA NA NA NA ...
$ has_availability : chr "t" "t" "t" "t" ...
$ availability_30 : int 18 5 2 10 8 28 12 3 8 4 ...
$ availability_60 : int 48 25 2 29 22 58 28 4 27 13 ...
$ availability_90 : int 78 55 19 39 52 88 55 24 57 42 ...
$ availability_365 : int 351 151 218 60 106 269 84 287 332 65 ...
$ calendar_last_scraped : chr "2022-09-10" "2022-09-10" "2022-09-10" "2022-09-11" ...
$ number_of_reviews : int 9 54 145 30 10 0 62 74 59 48 ...
$ number_of_reviews_ltm : int 9 40 34 9 10 0 10 11 16 16 ...
$ number_of_reviews_l30d : int 9 4 3 3 0 0 0 0 0 2 ...
$ first_review : chr "2022-08-11" "2020-11-20" "2017-03-01" "2013-05-27" ...
$ last_review : chr "2022-09-08" "2022-08-26" "2022-09-06" "2022-08-29" ...
$ review_scores_rating : num 4.89 4.83 4.94 4.38 4.7 NA 4.73 4.34 4.07 4.52 ...
$ review_scores_accuracy : num 4.89 4.89 4.97 4.48 5 NA 4.92 4.34 4.47 4.73 ...
$ review_scores_cleanliness : num 5 4.7 4.94 4.72 4.9 NA 4.88 4.42 4.44 4.75 ...
$ review_scores_checkin : num 5 5 4.99 4.83 4.7 NA 4.93 4.84 4.53 4.71 ...
$ review_scores_communication : num 4.89 4.98 4.99 4.79 4.5 NA 4.98 4.82 4.47 4.75 ...
$ review_scores_location : num 4.89 4.52 4.7 4.79 4.4 NA 4.58 4.84 4.27 4.23 ...
$ review_scores_value : num 4.78 4.65 4.89 4.34 4.8 NA 4.6 4.45 4.22 4.52 ...
$ license : chr "Exempt" "HUTB-013294" "" "HUTB-002062" ...
$ instant_bookable : chr "t" "f" "f" "t" ...
$ calculated_host_listings_count : int 1 1 1 38 8 31 2 3 101 4 ...
$ calculated_host_listings_count_entire_homes : int 0 1 1 38 8 30 2 3 97 4 ...
$ calculated_host_listings_count_private_rooms: int 1 0 0 0 0 1 0 0 4 0 ...
$ calculated_host_listings_count_shared_rooms : int 0 0 0 0 0 0 0 0 0 0 ...
$ reviews_per_month : num 9 2.45 2.15 0.27 1.52 NA 0.44 0.54 1.35 0.75 ...
Several variables in the dataset are irrelevant to the analysis as they do not contribute meaningful information. These include variables like URLs, IDs and names. Further, a variable that was entirely empty was removed.
The following variables are removed as
they are identifiers: listing_url,
last_scraped, name, description,
neighborhood_overview, picture_url,
host_id, host_url, host_name,
host_location, host_about,
host_thumbnail_url, host_picture_url,
host_neighbourhood, host_listings_count, ,
host_has_profile_pic,
host_identity_verified,
they provide scraping details:calendar_updated,
calendar_last_scraped, source,
scrape_id
they are redundant in different aspects:
review_scores_accuracy,
review_scores_cleanliness,
review_scores_checkin,
review_scores_communication,
number_of_reviews_ltm, review_scores_location,
review_scores_value;
-calculated_host_listings_count_entire_homes,
host_total_listings_count
calculated_host_listings_count_private_rooms,
calculated_host_listings_count_shared_rooms;
-has_availability, availability_30,
availability_60, availability_90;
-minimum_minimum_nights,
maximum_minimum_nights,
minimum_maximum_nights,
maximum_maximum_nights,
minimum_nights_avg_ntm,
maximum_nights_avg_ntm,
number_of_reviews_l30d
they are descriptive in text and therefore cannot be anlayzed:
neighbourhood, property_type,
amenities, license,
bathrooms_text
it consists of only missing data: bathrooms
Several variables are categorical or binary so it needs to be
formatted for the later analysis. The variable price is
cleaned by removing dollar signs and commas, while the variables
neighbourhood, “room_type”,
host_response_time, host_is_superhost and
instant_bookable are converted to factors. For the
host_verifications variable, a function is used to count
the number of verifications for each host to make it a numerical
variable that can be analyzed statistically. The target variable
business is created from
calculated_host_listings_count to indicate hosts which
manage more than 5 listings. Other derived variables are created to
support the research question, such as verification_count
(number of host verifications).
airbnb = airbnb %>% select(-listing_url, -scrape_id, -last_scraped, -source, -name, -description, -neighborhood_overview, -picture_url, -host_id, -host_url, -host_name, -host_location, -host_about, -host_thumbnail_url, -host_picture_url, -host_neighbourhood, -host_listings_count, -host_total_listings_count, -host_has_profile_pic, -host_identity_verified, -neighbourhood, -property_type, -amenities, -minimum_minimum_nights, -maximum_minimum_nights, -minimum_maximum_nights, -maximum_maximum_nights, -minimum_nights_avg_ntm, -maximum_nights_avg_ntm, -calendar_updated, -calendar_last_scraped, -review_scores_accuracy, -review_scores_cleanliness, -review_scores_checkin, -review_scores_communication, -number_of_reviews_ltm, -review_scores_location, -review_scores_value, -license, -calculated_host_listings_count_entire_homes, -calculated_host_listings_count_private_rooms, -calculated_host_listings_count_shared_rooms, -bathrooms_text, -bathrooms, -has_availability, -availability_30, -availability_60, -availability_90, -number_of_reviews_l30d)
airbnb$price <- gsub("\\$", "", airbnb$price)
airbnb$price <- as.numeric(gsub(",", "", airbnb$price))
airbnb$neighbourhood_cleansed = as.factor(airbnb$neighbourhood_cleansed)
airbnb$neighbourhood_group_cleansed = as.factor(airbnb$neighbourhood_group_cleansed)
airbnb$room_type = as.factor(airbnb$room_type)
airbnb$host_response_time[airbnb$host_response_time == "N/A" | airbnb$host_response_time == ""] = NA
airbnb$host_response_time = as.factor(airbnb$host_response_time)
airbnb$host_is_superhost = ifelse(airbnb$host_is_superhost == "t", TRUE, ifelse(airbnb$host_is_superhost == "f", FALSE, NA))
airbnb$host_is_superhost = as.factor(airbnb$host_is_superhost)
airbnb$instant_bookable = ifelse(airbnb$instant_bookable == "t", TRUE, ifelse(airbnb$instant_bookable == "f", FALSE, NA))count_elements <- function(x) {
# Remove the square brackets and split by commas
elements <- strsplit(gsub("\\[|\\]", "", x), ",")
# Count the number of elements after splitting
sapply(elements, length)}
airbnb$verification_count = count_elements(airbnb$host_verifications)
airbnb$first_review <- as.Date(airbnb$first_review)
airbnb$host_since <- as.Date(airbnb$host_since)
airbnb$last_review <- as.Date(airbnb$last_review)
airbnb$host_response_rate <- as.numeric(gsub("%", "", airbnb$host_response_rate))
airbnb$host_acceptance_rate <- as.numeric(gsub("%", "", airbnb$host_acceptance_rate))
airbnb$business <- ifelse(airbnb$calculated_host_listings_count > 5, TRUE, FALSE)We use the summary() to identify variables that may
contain outliers
id host_since host_response_time
Min. :1.867e+04 Min. :2008-09-19 a few days or more: 393
1st Qu.:1.842e+07 1st Qu.:2013-10-28 within a day : 1225
Median :3.591e+07 Median :2016-06-19 within a few hours: 2426
Mean :1.147e+17 Mean :2016-07-16 within an hour :10034
3rd Qu.:5.127e+07 3rd Qu.:2019-02-26 NA's : 2842
Max. :7.128e+17 Max. :2022-09-07
NA's :2
host_response_rate host_acceptance_rate host_is_superhost host_verifications
Min. : 0.00 Min. : 0.00 FALSE:14114 Length:16920
1st Qu.: 95.00 1st Qu.: 89.00 TRUE : 2804 Class :character
Median :100.00 Median : 98.00 NA's : 2 Mode :character
Mean : 93.81 Mean : 88.17
3rd Qu.:100.00 3rd Qu.:100.00
Max. :100.00 Max. :100.00
NA's :2842 NA's :2547
neighbourhood_cleansed neighbourhood_group_cleansed
la Dreta de l'Eixample :2030 Eixample :5692
el Raval :1216 Ciutat Vella :3554
el Barri Gòtic :1048 Sants-Montjuïc:2146
la Sagrada Família : 961 Sant Martí :1640
la Vila de Gràcia : 952 Gràcia :1420
Sant Pere, Santa Caterina i la Ribera: 933 Les Corts : 755
(Other) :9780 (Other) :1713
latitude longitude room_type accommodates
Min. :41.32 Min. :2.045 Entire home/apt:10046 Min. : 0.000
1st Qu.:41.38 1st Qu.:2.155 Hotel room : 172 1st Qu.: 2.000
Median :41.39 Median :2.167 Private room : 6526 Median : 3.000
Mean :41.39 Mean :2.165 Shared room : 176 Mean : 3.487
3rd Qu.:41.40 3rd Qu.:2.177 3rd Qu.: 5.000
Max. :41.48 Max. :2.232 Max. :16.000
bedrooms beds price minimum_nights
Min. : 1.000 Min. : 1.000 Min. : 0.0 Min. : 1.00
1st Qu.: 1.000 1st Qu.: 1.000 1st Qu.: 50.0 1st Qu.: 1.00
Median : 1.000 Median : 2.000 Median : 100.0 Median : 3.00
Mean : 1.742 Mean : 2.443 Mean : 172.9 Mean : 13.27
3rd Qu.: 2.000 3rd Qu.: 3.000 3rd Qu.: 191.0 3rd Qu.: 31.00
Max. :20.000 Max. :40.000 Max. :90000.0 Max. :1124.00
NA's :571 NA's :299
maximum_nights availability_365 number_of_reviews first_review
Min. : 1.0 Min. : 0.0 Min. : 0.00 Min. :2010-10-03
1st Qu.: 180.0 1st Qu.: 39.0 1st Qu.: 1.00 1st Qu.:2017-04-01
Median : 365.0 Median :164.0 Median : 7.00 Median :2019-06-15
Mean : 651.4 Mean :170.8 Mean : 41.03 Mean :2019-03-06
3rd Qu.:1125.0 3rd Qu.:308.0 3rd Qu.: 44.00 3rd Qu.:2021-10-22
Max. :3000.0 Max. :365.0 Max. :1311.00 Max. :2022-09-10
NA's :3614
last_review review_scores_rating instant_bookable
Min. :2011-06-23 Min. :0.000 Mode :logical
1st Qu.:2022-03-20 1st Qu.:4.400 FALSE:8161
Median :2022-08-13 Median :4.670 TRUE :8759
Mean :2021-11-23 Mean :4.526
3rd Qu.:2022-08-28 3rd Qu.:4.890
Max. :2022-09-10 Max. :5.000
NA's :3614 NA's :3614
calculated_host_listings_count reviews_per_month verification_count
Min. : 1.00 Min. : 0.010 Min. :0.000
1st Qu.: 1.00 1st Qu.: 0.250 1st Qu.:2.000
Median : 4.00 Median : 0.890 Median :2.000
Mean : 19.51 Mean : 1.416 Mean :2.071
3rd Qu.: 20.00 3rd Qu.: 2.030 3rd Qu.:2.000
Max. :161.00 Max. :56.130 Max. :3.000
NA's :3614
business
Mode :logical
FALSE:9458
TRUE :7462
Outliers in the price, bedrooms and
beds variables are identified and replaced with NA. This is
to not let the outliers skew analysis results and ensure the data
reflects typical values more accurately.
ggplot(data = airbnb, aes(x = price)) +
geom_histogram(bins = 30, color = "blue", fill = "lightblue") + coord_cartesian(ylim = c(0, 100))ggplot(data = airbnb, aes(x = price)) + geom_histogram(bins = 150, color = "blue", fill = "lightblue") + coord_cartesian(xlim = c(0, 5000)) ggplot(data = airbnb, aes(x = price)) + geom_histogram(bins = 150, color = "blue", fill = "lightblue") + coord_cartesian(xlim = c(0, 10000), ylim = c(0, 1000)) airbnb = mutate(airbnb, price = ifelse(price == 0 |price > 3500, NA, price))
ggplot(data = airbnb, aes(x = bedrooms)) + geom_histogram(bins = 30, color = "blue", fill = "lightblue") + coord_cartesian(ylim = c(0, 5))ggplot(data = airbnb, aes(x = beds)) + geom_histogram(bins = 30, color = "blue", fill = "lightblue") + coord_cartesian(ylim = c(0, 5))airbnb = mutate(airbnb, beds = ifelse(beds > 22, NA, beds))
airbnb = mutate(airbnb, beds = ifelse(bedrooms > 15, NA, bedrooms))
airbnb = mutate(airbnb, accommodates = ifelse(accommodates == 0, NA, accommodates))
airbnb = mutate(airbnb, review_scores_rating = ifelse(review_scores_rating == 0, NA, review_scores_rating)) The gg_miss_var() function from the naniar package is
used to see which variables contain NA values. We consequently proceed
to impute these variables.
Additionally, to focus on relevant and active listings, properties
that are no longer active are removed from the dataset. This removal
only regards listings with no price available and with either their last
review before 2022 or with no availability ever in the next year. We
also excluded listings with the variable minimum_nights
higher than 3 months as a prior analysis of the dataset and the linked
online listings revealed it is set so high so that users cannot book
them, without advertisers having to remove the listing from the
platform. This step ensures that the analysis is as relevant as
possible.
# Imputing
airbnb$year_last_review = impute(airbnb$year_last_review, 'random')
airbnb$year_last_review <- year(airbnb$last_review)
# Removing inactive listings
airbnb = airbnb[!(airbnb$year_last_review != 2022 & is.na(airbnb$price)), ]
airbnb = airbnb[!(airbnb$availability_365 == 0 & is.na(airbnb$price)), ]
airbnb = airbnb[!(airbnb$minimum_nights > 92), ]
airbnb = airbnb[!is.na(airbnb$id), ]
airbnb$accommodates = impute(airbnb$accommodates, 'random')
airbnb$reviews_per_month = impute(airbnb$reviews_per_month, 'random')
airbnb$review_scores_rating = impute(airbnb$review_scores_rating, 'random')
airbnb$last_review = impute(airbnb$last_review, 'random')
airbnb$year_last_review = year(airbnb$last_review)
airbnb$first_review = impute(airbnb$first_review, 'random')
airbnb$host_response_time = impute(airbnb$host_response_time, 'random')
airbnb$host_response_rate = impute(airbnb$host_response_rate, 'random')
airbnb$host_acceptance_rate = impute(airbnb$host_acceptance_rate, 'random')
airbnb$host_is_superhost = impute(airbnb$host_is_superhost, 'random')
airbnb$bedrooms = impute(airbnb$bedrooms, 'random')
airbnb$beds = impute(airbnb$beds, 'random')
airbnb$host_since = impute(airbnb$host_since, 'random')
airbnb$price = impute(airbnb$price, 'random')
airbnb$host_response_time = impute(airbnb$host_response_time, 'random')
airbnb$year_host_since = (year(airbnb$host_since))
airbnb$year_first_review = (year(airbnb$first_review))Finally, we remove the variables host_verifications,
first_review, last_review,
calculated_host_listings_count,
year_last_review, host_since used to calculate
other variables and id as it is an identifier.
airbnb <- airbnb %>% select(-host_verifications, -id, -first_review, -last_review, -calculated_host_listings_count, -year_last_review, -host_since) 'data.frame': 16771 obs. of 24 variables:
$ host_response_time : Factor w/ 4 levels "a few days or more",..: 4 4 3 4 4 3 4 4 4 4 ...
..- attr(*, "imputed")= int [1:2775] 17 73 116 119 134 151 172 177 193 198 ...
$ host_response_rate : 'impute' num 100 100 100 98 100 96 100 100 100 96 ...
..- attr(*, "imputed")= int [1:2775] 17 73 116 119 134 151 172 177 193 198 ...
$ host_acceptance_rate : 'impute' num 100 100 97 93 100 84 100 100 99 97 ...
..- attr(*, "imputed")= int [1:2483] 17 73 116 119 151 172 177 198 217 257 ...
$ host_is_superhost : Factor w/ 2 levels "FALSE","TRUE": 1 2 2 1 2 1 2 1 1 1 ...
..- attr(*, "imputed")= int 5635
$ neighbourhood_cleansed : Factor w/ 73 levels "Baró de Viver",..: 29 37 66 40 66 35 9 11 29 66 ...
$ neighbourhood_group_cleansed: Factor w/ 10 levels "Ciutat Vella",..: 9 5 9 2 9 9 8 3 9 9 ...
$ latitude : num 41.4 41.4 41.4 41.4 41.4 ...
$ longitude : num 2.13 2.11 2.12 2.17 2.12 ...
$ room_type : Factor w/ 4 levels "Entire home/apt",..: 3 1 1 1 1 1 1 1 1 1 ...
$ accommodates : int 2 5 2 8 2 8 5 6 4 6 ...
$ bedrooms : 'impute' int 2 2 1 3 1 4 3 2 1 2 ...
..- attr(*, "imputed")= int [1:565] 68 110 283 493 757 766 767 775 777 794 ...
$ beds : 'impute' int 2 2 1 3 1 4 3 2 1 2 ...
..- attr(*, "imputed")= int [1:567] 68 110 283 493 757 766 767 775 777 794 ...
$ price : 'impute' num 59 110 86 180 110 71 230 140 305 123 ...
..- attr(*, "imputed")= int [1:2] 7002 7138
$ minimum_nights : int 1 3 3 1 2 31 5 2 1 1 ...
$ maximum_nights : int 1125 30 10 1125 365 1125 300 31 1125 1120 ...
$ availability_365 : int 351 151 218 60 106 269 84 287 332 65 ...
$ number_of_reviews : int 9 54 145 30 10 0 62 74 59 48 ...
$ review_scores_rating : 'impute' num 4.89 4.83 4.94 4.38 4.7 5 4.73 4.34 4.07 4.52 ...
..- attr(*, "imputed")= int [1:3635] 6 17 21 30 32 74 110 158 172 361 ...
$ instant_bookable : logi TRUE FALSE FALSE TRUE TRUE FALSE ...
$ reviews_per_month : 'impute' num 9 2.45 2.15 0.27 1.52 0.03 0.44 0.54 1.35 0.75 ...
..- attr(*, "imputed")= int [1:3543] 6 17 21 30 32 74 158 172 361 367 ...
$ verification_count : int 2 2 2 2 1 2 2 3 2 2 ...
$ business : logi FALSE FALSE FALSE TRUE TRUE TRUE ...
$ year_host_since : num 2015 2018 2017 2010 2022 ...
$ year_first_review : num 2022 2020 2017 2013 2022 ...
In the cleaned dataset we have:
Binary variables: host_is_superhost,
instant_bookable, business Nominal variables:
neighbourhood_cleansed,
neighbourhood_group_cleansed, room_type
Ordinal variables: year_host_since,
year_first_review, host_response_time
Numerical variables: host_response_rate,
host_acceptance_rate, latitude,
longitude, accommodates,
bedrooms, price, minimum_nights,
availability_365, number_of_reviews,
number_of_reviews_l30d, review_scores_rating,
reviews_per_month.
The target variable is business and here we report its
summary.
Mode FALSE TRUE
logical 9366 7405
Moreover, we create a bar plot to visualize the distributions.
ggplot(data = airbnb) +
geom_bar(aes(x = business), fill = c("#df546b", "#2297e6")) +
labs(title = "Bar plot for the target variable 'business'")
FALSE TRUE
0.558464 0.441536
Professional hosts manage 44% of the listings in the dataset.
We report bar plots for the binary variables
host_is_superhost and instant_bookable:
host_is_superhost
ggplot(data = airbnb) +
geom_bar(aes(x = host_is_superhost, fill = business)) +
scale_fill_manual(values = c("#df546b", "#2297e6")) + theme(axis.text.x = element_text(angle = 45, hjust = 1))ggplot(data = airbnb) +
geom_bar(aes(x = host_is_superhost, fill = business), position = "fill") +
scale_fill_manual(values = c("#df546b", "#2297e6")) + theme(axis.text.x = element_text(angle = 45, hjust = 1))
The bar plot indicates that there is an important difference between the
two groups for the prediction of the target variable.
instant_bookable
ggplot(data = airbnb) +
geom_bar(aes(x = instant_bookable, fill = business)) +
scale_fill_manual(values = c("#df546b", "#2297e6")) + theme(axis.text.x = element_text(angle = 45, hjust = 1))ggplot(data = airbnb) +
geom_bar(aes(x = instant_bookable, fill = business), position = "fill") +
scale_fill_manual(values = c("#df546b", "#2297e6")) + theme(axis.text.x = element_text(angle = 45, hjust = 1))
Similarly, the graph shows evidence supporting that the variable
instant_bookable is important for the prediction of
business since it is more common for professional hosts to
allow users to instantly book the accomodation.
neighbourhood_group_cleansed
The variable neighbourhood_group_cleansed reports in
which of 10 Barcelona city districts a listing is located in.
ggplot(data = airbnb) +
geom_bar(aes(x = neighbourhood_group_cleansed, fill = business)) +
scale_fill_manual(values = c("#df546b", "#2297e6")) + theme(axis.text.x = element_text(angle = 45, hjust = 1))ggplot(data = airbnb) +
geom_bar(aes(x = neighbourhood_group_cleansed, fill = business), position = "fill") +
scale_fill_manual(values = c("#df546b", "#2297e6")) + theme(axis.text.x = element_text(angle = 45, hjust = 1))
The bar plot shows that there is an important difference between the
rates of professional hosts in the different districts. For instance,
Eixample and Ciutat Vella, the districts with more offering since
located in the historical center have a higher rate of professional
hosts than the areas of Nou Barri and Sant Andreu, located in the
outskirts of the city.
neighbourhood_cleansed
ggplot(data = airbnb) +
geom_bar(aes(x = neighbourhood_cleansed, fill = business)) +
scale_fill_manual(values = c("#df546b", "#2297e6")) + coord_flip()ggplot(data = airbnb) +
geom_bar(aes(x = neighbourhood_cleansed, fill = business), position = "fill") +
scale_fill_manual(values = c("#df546b", "#2297e6")) + coord_flip()
This visualization is an in depth look compared to that of the variable
neighbourhood_group_cleansed. For instance, the
neighborhoods with the most listings such as la Dreta de l’Eixample, el
Raval and el Barri Gotic all have a high presence of professional
hosts.
room_type
ggplot(data = airbnb) +
geom_bar(aes(x = room_type, fill = business)) +
scale_fill_manual(values = c("#df546b", "#2297e6")) + theme(axis.text.x = element_text(angle = 45, hjust = 1))ggplot(data = airbnb) +
geom_bar(aes(x = room_type, fill = business), position = "fill") +
scale_fill_manual(values = c("#df546b", "#2297e6")) + theme(axis.text.x = element_text(angle = 45, hjust = 1))
Although both the values “Hotel room” and “Shared room” are extremely
low in count, the variable
room_type, it appears to be
important for the prediction of the target variable
business as we can see from the difference between the
business rate in “Private room” and “Entire home/apt”.
host_response_time
ggplot(data = airbnb) +
geom_bar(aes(x = host_response_time, fill = business)) +
scale_fill_manual(values = c("#df546b", "#2297e6")) + theme(axis.text.x = element_text(angle = 45, hjust = 1))ggplot(data = airbnb) +
geom_bar(aes(x = host_response_time, fill = business), position = "fill") +
scale_fill_manual(values = c("#df546b", "#2297e6")) + theme(axis.text.x = element_text(angle = 45, hjust = 1))
Since the bar plots are not able to show graphical evidence for the
importance of the variable in the prediction of
business,
we run a chi-squared test to determine if their relation is
statistically significant.
\[ \bigg\{ \begin{matrix} H_0: \pi_{fewdays, \ T} = \pi_{withinday, \ T} = \pi_{withinhours, \ T} = \pi_{withinhour, \ T}\\ H_a: At \ least \ one \ of \ the \ claims \ in \ H_0 \ is \ wrong. \end{matrix} \]
Pearson's Chi-squared test
data: table(airbnb$business, airbnb$host_response_time)
X-squared = 68.395, df = 3, p-value = 9.415e-15
As the p-value = 2.098e-14 we can reject the null hypothesis and consider the variable as important.
year_first_review
ggplot(data = airbnb) +
geom_bar(aes(x = year_first_review, fill = business)) +
scale_fill_manual(values = c("#df546b", "#2297e6")) + theme(axis.text.x = element_text(angle = 45, hjust = 1))ggplot(data = airbnb) +
geom_bar(aes(x = year_first_review, fill = business), position = "fill") +
scale_fill_manual(values = c("#df546b", "#2297e6")) + theme(axis.text.x = element_text(angle = 45, hjust = 1))
These bar plots seem to indicate that the later the Airbnb received its
first review (date which is supposedly close to when it first started to
accomodate people) the more likely it is that the Airbnb is run by a
business. In order to be sure of this, we run a chi-squared test.
\[ \bigg\{ \begin{matrix} H_0: \pi_{2010, \ T} = \pi_{2011, \ T} = \pi_{2012, \ T} = \pi_{2013, \ T}= \pi_{2014, \ T}= \pi_{2015, \ T}= \pi_{2016, \ T}= \pi_{2017, \ T}= \pi_{2018, \ T}= \pi_{2018, \ T}= \pi_{2020, \ T}= \pi_{2021, \ T}= \pi_{2022, \ T}\\ H_a: At \ least \ one \ of \ the \ claims \ in \ H_0 \ is \ wrong. \end{matrix} \]
Pearson's Chi-squared test
data: table(airbnb$business, airbnb$year_first_review)
X-squared = 293.23, df = 12, p-value < 2.2e-16
The test allows us to reject the null hypothesis as the p-value <
2.2e-16 and confirms the relation between the target variable and
year_first_review
year_host_since
ggplot(data = airbnb) +
geom_bar(aes(x = year_host_since, fill = business)) +
scale_fill_manual(values = c("#df546b", "#2297e6")) + theme(axis.text.x = element_text(angle = 45, hjust = 1))ggplot(data = airbnb) +
geom_bar(aes(x = year_host_since, fill = business), position = "fill") +
scale_fill_manual(values = c("#df546b", "#2297e6")) + theme(axis.text.x = element_text(angle = 45, hjust = 1))
The bar plot indicates that the variable
year_host_since
has higher rates of professional hosts between 2009 and 2012, and in
2020, thus revealing a relationship between the two variables.
In the following, we investigate the numerical variables in the Airbnb dataset between business and non-business listings. Firstly, a correlation matrix is used to visualize correlations of the variables. Since no perfect correlations were found, none of the variables have to be removed. We can see a high positive correlation between the variables “accommodates” and “bedrooms” indicating, which makes sense as a listing with more bedrooms accomodates more guests. Further a negative correlation between price and minimum nights can be seen, which indicates listings with longer minimum stays generally charge lower prices per night. Conversly, listings with higher nightly prices seem to allow shorter stays.
variable_list = c("host_response_rate", "host_acceptance_rate", "latitude", "longitude", "accommodates", "bedrooms", "price", "minimum_nights", "availability_365", "number_of_reviews", "review_scores_rating", "reviews_per_month", "verification_count")
cor_matrix = cor(airbnb[, variable_list])
ggcorrplot(cor_matrix, type = "lower", lab = TRUE, lab_size = 3)price
We further investigate the difference in the variable “price” for business and non-business listings. The boxplot reflects accomodations of professional hosts to be higher on average than listings from non-professional hosts. The non-professional host prices are seen to variate more but have a lower median and box.
The density plot also reflects this. The the business curve is spread wider while the non-business curve is more concentrate and has a higher peak. In general the business curve is further to the right, indicating a higher average price.
This shows professional hosts to have higher average prices on their listings, which could be due to them being more profit-oriented or having access to more valuable properties.
availability_365
Investigating the variable availability_365, it can be
seen that listings of professional hosts are available during more times
of the year. While the non-professional mean lays around 110, the mean
for the professional listings is closer to 225, almost double the
amount. This is also represented by the boxes quartiles and
distribution.
The same gets reflected in the density plot, where business and non-business accommodations are seen to peek at different ends of the spectrum. Both also have small peaks on the other side but the non-profit accommodation graph has its’ main peak at around 5 while the profit accommodation graph peaks around 330.
This reflects a big difference in availability between profit and non-profit hosts listings. This could be due to non-professional hosts partly living in their accommodations or needing more time in between the stays to clean out the space.
accommodates
Investigating the variable accommodates reveals that
business listings tend to accommodate a higher number of guests on
average than non-business listings. The median number of guests for
business listings is higher and the interquartile range indicates
business listings have a wider spread in their guest capacity.
Non-business listings on the other hand show lower capacity with a
tighter range around their median.
The density plot further illustrates this difference. The curve for business listings is shifted to the right, indicating that these properties are more likely to accomodate larger groups. The peak for business listings is lower but more spread out, again inidcating greater variability in how many people they can host. In contrast, non-business listings have a higher peak but are more concentrated around lower guest capacities.
These findings indicate that professional hosts generally offer a larger range of accommodations with a higher variability of how many guests can be accommodated. On average they are also seen to offer accommodations with more capacity, indicating bigger space. This could also be the reason for the difference in price that was detected earlier in the analysis.
review_scores_rating
The variable “review_scores_rating” reflects non-business Airbnb listings to have higher ratings than business listings on average. The boxplot illustrates this finding as the median is higher and the interquartile range is narrower, indicating that non-business listings generally receive consistently positive reviews with fewer extreme outliers. Business listings on the other hand show more variability in review scores. The median is lower and the IQR wider.
The density plot confirms this pattern. The curve for non-business listings has a higher peak and is more concentrated around higher ratings, implying individual hosts are more likely to achieve high guest satisfaction.
These findings suggest that non-business accommodations may provide more personalized, unique experiences that lead to higher guest satisfaction. In contrast, business listings, which might offer more standardized services, show a wider spread of ratings, possibly due to varying quality across a large number of managed properties. \[ \bigg\{ \begin{matrix} H_0: \mu_1 = \mu_2 \\ H_a: \mu_1 \neq \mu_2 \end{matrix} \]
Welch Two Sample t-test
data: review_scores_rating by business
t = 16.373, df = 14818, p-value < 2.2e-16
alternative hypothesis: true difference in means between group FALSE and group TRUE is not equal to 0
95 percent confidence interval:
0.1182471 0.1504091
sample estimates:
mean in group FALSE mean in group TRUE
4.615334 4.481006
To verify our assumption of the statistical significance of the variable we run a T-test that compares the mean of the variable between business and non-business listings. As the p-value is lower than 0.05 we reject H0 and conclude that “review_scores_rating” does deviate between the two significantly.
reviews_per_month
The analysis of the “reviews_per_month” variable reveals no clear
visible trend. The differences are subtle even though the boxplot
indicates that non-business listings have slightly hgiher average
reviews per month. Both categories display small and dense boxes,
suggesting that review counts for both listings are clustered closely
around their respective medians.
The density plot emphasizes this obervation. The distribution for non-business listings peaks towards the far left, same as the the business listings, even though it has a slightly wider spread. \[ \bigg\{ \begin{matrix} H_0: \mu_1 = \mu_2 \\ H_a: \mu_1 \neq \mu_2 \end{matrix} \]
Welch Two Sample t-test
data: reviews_per_month by business
t = 9.5215, df = 16767, p-value < 2.2e-16
alternative hypothesis: true difference in means between group FALSE and group TRUE is not equal to 0
95 percent confidence interval:
0.2027027 0.3077940
sample estimates:
mean in group FALSE mean in group TRUE
1.522119 1.266871
The T-test revealed the difference to be significant and reflected a higher mean in the non-business reviews. The slight increase in non-business listings could stem from similar reasons to the better review ratings, such as authentic and personal guest experiences.
verification_count
Although the boxplot shows that the averages for the two groups are the
same, the density chart shows that listings whose host has a
verification count of 3 has a higher chance of being managed by a
professional host, while it is the opposite for those with a
verification count of 1 and 2.
beds
The analysis of the “beds” variable highlights a noticable difference
between business and non business Airbnb listings. The boxplot reveals
that business listings have a higher average number of beds compared to
their non-business counterparts. Further the box is wider, representing
a bigger variability in listings.
The density plot further supports these findings. The curve for business listings extends further to the right, indicating a higher likelihood of listings with multiple beds. Conversely, the density for non-business listings peaks at lower bed counts, reflecting that these listings are generally designed to accommodate fewer guests.
Overall, these findings align with the “accommodates” variable. Business-listings are more likely to have more beds and have a higher variability in their offers.
host_response_time
The boxplot shows that listings with professional hosts have a lower
mean of response rate, which also appears in the density graph. We run a
T-test to verify the significance of this relation. \[
\bigg\{
\begin{matrix}
H_0: \mu_1 = \mu_2 \\
H_a: \mu_1 \neq \mu_2
\end{matrix}
\]
Welch Two Sample t-test
data: host_response_rate by business
t = 4.0939, df = 16546, p-value = 4.262e-05
alternative hypothesis: true difference in means between group FALSE and group TRUE is not equal to 0
95 percent confidence interval:
0.494944 1.404259
sample estimates:
mean in group FALSE mean in group TRUE
94.16176 93.21215
Since the T-test has a p-value of 3.16e-08 the null hypothesis can be rejected and the variable be considered important.
host_acceptance_rate
Despite the fact that the boxplots comparing the two groups have
different IQR, but a highly similar mean, the density graph suggests
that
\[ \bigg\{ \begin{matrix} H_0: \mu_1 = \mu_2 \\ H_a: \mu_1 \neq \mu_2 \end{matrix} \]
Welch Two Sample t-test
data: host_acceptance_rate by business
t = -7.6287, df = 16664, p-value = 2.499e-14
alternative hypothesis: true difference in means between group FALSE and group TRUE is not equal to 0
95 percent confidence interval:
-3.231793 -1.910529
sample estimates:
mean in group FALSE mean in group TRUE
87.11446 89.68562
To verify the significance of the variable
host_acceptance_rate in predicting business,
we use a T-test which allows us to reject the null hypothesis and
consider the variable as important, since its p-value is lower than α =
0.05.
number_of_reviews
The difference between the two groups seems to be slight considering
both the boxplot and the density graph. Consequently, we use a T-test to
verify if there is a significant difference between the two. \[
\bigg\{
\begin{matrix}
H_0: \mu_1 = \mu_2 \\
H_a: \mu_1 \neq \mu_2
\end{matrix}
\]
Welch Two Sample t-test
data: number_of_reviews by business
t = 15.119, df = 16302, p-value < 2.2e-16
alternative hypothesis: true difference in means between group FALSE and group TRUE is not equal to 0
95 percent confidence interval:
15.07288 19.56323
sample estimates:
mean in group FALSE mean in group TRUE
48.91576 31.59770
Since the t-test shows a p-value < 2.2e-16, we can reject the null
hypothesis and use the variable number_of_reviews to
predict the target variable.
latitude and latitude
ggplot(airbnb, aes(x = longitude, y = latitude, color = business)) +
geom_point(alpha = 0.7) +
theme_minimal() +
theme(legend.position = "right")
Plotting the two variables together allows us to visualize the
distribution of listings with professional hosts through different
coordinates. In fact, we can identify clusters in the top and bottom
left of listings with mostly professional hosts. To verify whether these
variables are important for the prediction of the target variable we run
a T-test for each.
Welch Two Sample t-test
data: longitude by business
t = 2.3753, df = 16761, p-value = 0.01755
alternative hypothesis: true difference in means between group FALSE and group TRUE is not equal to 0
95 percent confidence interval:
0.0001437123 0.0015005787
sample estimates:
mean in group FALSE mean in group TRUE
2.165524 2.164702
Welch Two Sample t-test
data: latitude by business
t = 5.4183, df = 16705, p-value = 6.1e-08
alternative hypothesis: true difference in means between group FALSE and group TRUE is not equal to 0
95 percent confidence interval:
0.0008497502 0.0018130277
sample estimates:
mean in group FALSE mean in group TRUE
41.39212 41.39078
The T-tests for longitude and latitude have
a p-value under α = 0.05. Thus, they can be use to build a model to
predict business
minimum_nights
ggplot(data = airbnb) +
geom_density(aes(x = minimum_nights, fill = business), alpha = 0.3) + coord_cartesian(xlim = c(0, 50))
\[
\bigg\{
\begin{matrix}
H_0: \mu_1 = \mu_2 \\
H_a: \mu_1 \neq \mu_2
\end{matrix}
\]
Welch Two Sample t-test
data: minimum_nights by business
t = 1.5418, df = 16047, p-value = 0.1231
alternative hypothesis: true difference in means between group FALSE and group TRUE is not equal to 0
95 percent confidence interval:
-0.09816614 0.82180798
sample estimates:
mean in group FALSE mean in group TRUE
11.92387 11.56205
These plots indicate that the two groups have little difference when
it comes to the variable minimum_nights and the T-test
resulting in a p-value of 0.1231 confirms this.
maximum_nights
ggplot(data = airbnb) +
geom_density(aes(x = maximum_nights, fill = business), alpha = 0.3) + coord_cartesian(xlim = c(0, 50))
The variable
maximum_nights has a higher mean for listings
with professional hosts, compared to the others. \[
\bigg\{
\begin{matrix}
H_0: \mu_1 = \mu_2 \\
H_a: \mu_1 \neq \mu_2
\end{matrix}
\]
Welch Two Sample t-test
data: maximum_nights by business
t = -12.43, df = 16471, p-value < 2.2e-16
alternative hypothesis: true difference in means between group FALSE and group TRUE is not equal to 0
95 percent confidence interval:
-104.39678 -75.95606
sample estimates:
mean in group FALSE mean in group TRUE
610.8360 701.0124
Running a T-test to verify the significance of
maximum_nights for the prediction of business
results in a p-value < 2.2e-16, confirming the variable’s
significance.
In order to apply the kNN algorithm, we create dummy variables for
n-1 levels of room_type,
neighbourhood_group_cleansed and
neighbourhood_cleansed
airbnb$dummy_entire_home <- ifelse(airbnb$room_type == "Entire home/apt", 1, 0)
airbnb$dummy_hotel_room <- ifelse(airbnb$room_type == "Hotel room", 1, 0)
airbnb$dummy_private_room <- ifelse(airbnb$room_type == "Private room", 1, 0)
airbnb$neighgroup1 <- ifelse(airbnb$neighbourhood_group_cleansed == "Ciutat Vella", 1, 0)
airbnb$neighgroup2 <- ifelse(airbnb$neighbourhood_group_cleansed == "Eixample", 1, 0)
airbnb$neighgroup3 <- ifelse(airbnb$neighbourhood_group_cleansed == "Gràcia", 1, 0)
airbnb$neighgroup4 <- ifelse(airbnb$neighbourhood_group_cleansed == "Horta-Guinardó", 1, 0)
airbnb$neighgroup5 <- ifelse(airbnb$neighbourhood_group_cleansed == "Les Corts", 1, 0)
airbnb$neighgroup6 <- ifelse(airbnb$neighbourhood_group_cleansed == "Nou Barris", 1, 0)
airbnb$neighgroup7 <- ifelse(airbnb$neighbourhood_group_cleansed == "Sant Andreu", 1, 0)
airbnb$neighgroup8 <- ifelse(airbnb$neighbourhood_group_cleansed == "Sant Martí", 1, 0)
airbnb$neighgroup9 <- ifelse(airbnb$neighbourhood_group_cleansed == "Sarrià-Sant Gervasi", 1, 0)# Create a vector of neighborhood names
#neighborhoods <- c( "Baró de Viver", "Can Baró", "Can Peguera", "Canyelles", "Ciutat Meridiana", "Diagonal Mar i el Front Marítim del Poblenou", "el Baix Guinardó", "el Barri Gòtic", "el Besòs i el Maresme", "el Bon Pastor", "el Camp d'en Grassot i Gràcia Nova", "el Camp de l'Arpa del Clot", "el Carmel", "el Clot", "el Coll", "el Congrés i els Indians", "el Fort Pienc", "el Guinardó", "el Parc i la Llacuna del Poblenou", "el Poble Sec", "el Poblenou", "el Putxet i el Farró", "el Raval", "el Turó de la Peira", "Horta", "Hostafrancs", "l'Antiga Esquerra de l'Eixample", "la Barceloneta", "la Bordeta", "la Clota", "la Dreta de l'Eixample", "la Font d'en Fargues", "la Font de la Guatlla", "la Guineueta", "la Marina de Port", "la Marina del Prat Vermell", "la Maternitat i Sant Ramon", "la Nova Esquerra de l'Eixample", "la Prosperitat", "la Sagrada Família", "la Sagrera", "la Salut", "la Teixonera", "la Trinitat Nova", "la Trinitat Vella", "la Vall d'Hebron", "la Verneda i la Pau", "la Vila de Gràcia", "la Vila Olímpica del Poblenou", "les Corts", "les Roquetes", "les Tres Torres", "Montbau", "Navas", "Pedralbes", "Porta", "Provençals del Poblenou", "Sant Andreu", "Sant Antoni", "Sant Genís dels Agudells", "Sant Gervasi - Galvany", "Sant Gervasi - la Bonanova", "Sant Martí de Provençals", "Sant Pere, Santa Caterina i la Ribera", "Sants", "Sants - Badal", "Sarrià", "Torre Baró", "Vallbona", "Vallcarca i els Penitents", "Vallvidrera, el Tibidabo i les Planes", "Verdun", "Vilapicina i la Torre Llobeta" )
# Create dummy variables for each neighborhood without spaces in the variable names
#for (i in seq_along(neighborhoods)) {
# neigh <- neighborhoods[i]
# Use the index to create a variable name
# airbnb[[paste0("neigh_", i)]] <- ifelse(airbnb$neighbourhood_cleansed == neigh, 1, 0)
#}Additionally, we randomly partition the dataset into a training set (80%) and a testing set (20%). In doing so we use a seed to ensure the replicability of the results.
set.seed(8)
data_sets = partition(data = airbnb, prob = c(0.8, 0.2))
train_set = data_sets$part1
test_set = data_sets$part2
actual_test = test_set$business\[ \bigg\{ \begin{matrix} H_0: \pi_{business(yes),\ train} = \pi_{business(yes),\ test} \\ H_a: \pi_{business(yes),\ train} \neq \pi_{business(yes),\ test} \end{matrix} \]
x1 = sum(train_set$business == TRUE)
x2 = sum(test_set $business == TRUE)
n1 = nrow(train_set)
n2 = nrow(test_set)
prop.test(x = c(x1, x2), n = c(n1, n2))
2-sample test for equality of proportions with continuity correction
data: c(x1, x2) out of c(n1, n2)
X-squared = 0.001958, df = 1, p-value = 0.9647
alternative hypothesis: two.sided
95 percent confidence interval:
-0.01968212 0.01845348
sample estimates:
prop 1 prop 2
0.4414147 0.4420290
\(H_0\) is not rejected as the p-value is 0.9647, thus higher than α = 0.05. Therefore, the difference between the proportions of listings whose host is a business in the training and testing datasets is not significantly different so we can proceed with data modelling.
Based on the results of the Exploratory Data Analysis, 20 of out 23
predictors in the cleaned dataset have been identified as influencing
the target variable business: room_type,
price, instant_bookable,
reviews_per_month, host_response_time,
number_of_reviews, neighbourhood_cleansed,
host_is_superhost, year_first_review,
year_host_since, neighbourhood_group_cleansed,
accommodates, availability_365,
review_scores_rating, host_response_rate,
host_acceptance_rate, beds,
maximum_nights,verification_count, and
latitude. Using the partitioned dataset, we will apply
different machine learning algorithms with these selected predictors to
assess their effectiveness in predicting host professionalization in
Airbnb.
Using logistic regression, we aim to classify whether an Airbnb listing is managed by a professional host or not. To do so we will use the predictors identified in the EDA.
formula = business ~ room_type + price + instant_bookable + reviews_per_month + host_response_time + number_of_reviews + host_is_superhost + year_first_review + year_host_since + neighbourhood_group_cleansed + accommodates + availability_365 + review_scores_rating + host_response_rate + host_acceptance_rate + beds + maximum_nights + verification_count + latitude
regress = glm(formula, data = train_set, family = binomial)We use summary()to view a summary of the regression
results
Call:
glm(formula = formula, family = binomial, data = train_set)
Coefficients:
Estimate Std. Error z value
(Intercept) 1.160e+02 1.102e+02 1.052
room_typeHotel room 4.281e-02 2.008e-01 0.213
room_typePrivate room -1.748e+00 5.601e-02 -31.214
room_typeShared room -3.186e-01 1.888e-01 -1.688
price 1.126e-03 1.510e-04 7.458
instant_bookableTRUE 3.912e-01 4.892e-02 7.997
reviews_per_month 1.217e-02 1.472e-02 0.827
host_response_timewithin a day 3.737e-01 1.855e-01 2.014
host_response_timewithin a few hours 4.478e-01 1.958e-01 2.287
host_response_timewithin an hour 6.460e-01 1.968e-01 3.283
number_of_reviews -5.809e-03 4.303e-04 -13.499
host_is_superhostTRUE -5.594e-01 6.075e-02 -9.208
year_first_review 5.739e-02 9.728e-03 5.900
year_host_since -6.339e-02 7.499e-03 -8.453
neighbourhood_group_cleansedEixample 5.030e-01 6.350e-02 7.921
neighbourhood_group_cleansedGràcia 3.003e-01 1.008e-01 2.979
neighbourhood_group_cleansedHorta-Guinardó -5.011e-01 1.758e-01 -2.850
neighbourhood_group_cleansedLes Corts -1.093e-01 1.106e-01 -0.988
neighbourhood_group_cleansedNou Barris -8.981e-01 3.041e-01 -2.953
neighbourhood_group_cleansedSant Andreu -7.659e-01 2.149e-01 -3.563
neighbourhood_group_cleansedSant Martí -4.331e-01 1.029e-01 -4.207
neighbourhood_group_cleansedSants-Montjuïc -9.826e-02 8.030e-02 -1.224
neighbourhood_group_cleansedSarrià-Sant Gervasi 2.640e-01 1.222e-01 2.161
accommodates -2.245e-02 1.657e-02 -1.355
availability_365 2.390e-03 1.667e-04 14.332
review_scores_rating -3.125e-01 4.107e-02 -7.607
host_response_rate -8.769e-03 2.098e-03 -4.180
host_acceptance_rate 3.420e-03 1.185e-03 2.886
beds 7.348e-02 3.110e-02 2.363
maximum_nights 2.587e-04 4.528e-05 5.713
verification_count 7.406e-01 5.136e-02 14.420
latitude -2.529e+00 2.610e+00 -0.969
Pr(>|z|)
(Intercept) 0.292858
room_typeHotel room 0.831154
room_typePrivate room < 2e-16 ***
room_typeShared room 0.091436 .
price 8.75e-14 ***
instant_bookableTRUE 1.27e-15 ***
reviews_per_month 0.408422
host_response_timewithin a day 0.043992 *
host_response_timewithin a few hours 0.022173 *
host_response_timewithin an hour 0.001029 **
number_of_reviews < 2e-16 ***
host_is_superhostTRUE < 2e-16 ***
year_first_review 3.64e-09 ***
year_host_since < 2e-16 ***
neighbourhood_group_cleansedEixample 2.35e-15 ***
neighbourhood_group_cleansedGràcia 0.002894 **
neighbourhood_group_cleansedHorta-Guinardó 0.004368 **
neighbourhood_group_cleansedLes Corts 0.323280
neighbourhood_group_cleansedNou Barris 0.003146 **
neighbourhood_group_cleansedSant Andreu 0.000366 ***
neighbourhood_group_cleansedSant Martí 2.58e-05 ***
neighbourhood_group_cleansedSants-Montjuïc 0.221078
neighbourhood_group_cleansedSarrià-Sant Gervasi 0.030707 *
accommodates 0.175572
availability_365 < 2e-16 ***
review_scores_rating 2.80e-14 ***
host_response_rate 2.92e-05 ***
host_acceptance_rate 0.003898 **
beds 0.018143 *
maximum_nights 1.11e-08 ***
verification_count < 2e-16 ***
latitude 0.332630
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 18473 on 13458 degrees of freedom
Residual deviance: 14018 on 13427 degrees of freedom
AIC: 14082
Number of Fisher Scoring iterations: 4
Normality: Observing the Q-Q plot, it shows a mostly diagonal line which suggests approximate normality. Slight deviations are seen at higher quantiles.
Linearity: Observing the Residual vs. Fitted Plot, it reveals a rather curved plot than randomly spread around zero which suggests that the linear relationship might not hold well. This indicates potential non-linearity, meaning some predictors may not have a linear relationship with the response variable.
Independence: When looking at the residuals vs fitted plot and the residuals vs leverage, we do see trends and clustering in some of the plots. This could indicate that some issues with independence are present.
Variance: The residuals vs leverage plot shows a trend towards 0 which could indicate heteroscedasticity. This is also confirmed by the scale-location plot that shows patterns and not a random spread of points or a horizontal line.
We apply the naive Bayes classifier using the predictors and the
target variable business through the following formula,
which does not include neighbourhood_group_cleansed as the
algorithm requires predictors to be independent:
formula = business ~ room_type + price + instant_bookable + host_response_time + number_of_reviews + host_is_superhost + year_first_review + year_host_since + neighbourhood_group_cleansed + accommodates + availability_365 + review_scores_rating + host_response_rate + host_acceptance_rate + beds + maximum_nights + verification_count + latitude + longitude + reviews_per_month We use the naive_bayes() command from the R package
naivebayes to apply the algorithm to the training set
================================= Naive Bayes ==================================
Call:
naive_bayes.formula(formula = formula, data = train_set)
--------------------------------------------------------------------------------
Laplace smoothing: 0
--------------------------------------------------------------------------------
A priori probabilities:
FALSE TRUE
0.5585853 0.4414147
--------------------------------------------------------------------------------
Tables:
--------------------------------------------------------------------------------
:: room_type (Categorical)
--------------------------------------------------------------------------------
room_type FALSE TRUE
Entire home/apt 0.424980048 0.809123043
Hotel room 0.005719606 0.015822252
Private room 0.559856345 0.162767211
Shared room 0.009444001 0.012287494
--------------------------------------------------------------------------------
:: price (Gaussian)
--------------------------------------------------------------------------------
price FALSE TRUE
mean 112.3449 192.7766
sd 138.8780 197.7812
--------------------------------------------------------------------------------
:: instant_bookable (Bernoulli)
--------------------------------------------------------------------------------
instant_bookable FALSE TRUE
FALSE 0.5480181 0.3898334
TRUE 0.4519819 0.6101666
--------------------------------------------------------------------------------
:: host_response_time (Categorical)
--------------------------------------------------------------------------------
host_response_time FALSE TRUE
a few days or more 0.02553871 0.02861471
within a day 0.09749933 0.07742804
within a few hours 0.18582070 0.15132133
within an hour 0.69114126 0.74263592
--------------------------------------------------------------------------------
:: number_of_reviews (Gaussian)
--------------------------------------------------------------------------------
number_of_reviews FALSE TRUE
mean 48.98111 31.61623
sd 89.26073 58.84297
--------------------------------------------------------------------------------
# ... and 15 more tables
--------------------------------------------------------------------------------
Here we can observe probability tables for each variable.
To get a summary of the model, we use the summary()
function:
================================= Naive Bayes ==================================
- Call: naive_bayes.formula(formula = formula, data = train_set)
- Laplace: 0
- Classes: 2
- Samples: 13459
- Features: 20
- Conditional distributions:
- Bernoulli: 2
- Categorical: 3
- Gaussian: 15
- Prior probabilities:
- FALSE: 0.5586
- TRUE: 0.4414
--------------------------------------------------------------------------------
The summary reports that the model was trained through 13461 samples,
each with 20 features, the two binary variables
host_is_superhost and instant_bookable using
the Bernoulli distribution, the numerical variables with the Gaussian
distribution and the categorical variables, like
neighbourhood_cleansed with a Categorical distribution.
In order to apply the k-Nearest Neighbor algorithm, we have created
dummy variables for the categorical variables
neighbourhood_group_cleansed,neighbourhood_cleansed,
and room_type, and added them in the following formula:
formula_knn = business ~ price + availability_365 + accommodates + review_scores_rating + instant_bookable + host_response_time + number_of_reviews + host_is_superhost + year_first_review + year_host_since + host_response_rate + host_acceptance_rate + beds + maximum_nights + verification_count + dummy_entire_home + dummy_hotel_room + dummy_private_room + neighgroup1 + latitude + longitude + reviews_per_month + neighgroup2 +neighgroup3 + neighgroup4 + neighgroup4 + neighgroup5 + neighgroup6 + neighgroup7 + neighgroup8+ neighgroup9 To find the optimal value of k based on Error Rate, we run kNN for the training set and plot the Error Rates for different values of k.
kNN.plot(formula_knn, train = train_set, test = test_set, transform = "minmax", k.max = 30, set.seed = 7)
From the plot, we can observe that the optimal value of k is 1 as it has
the lowest Error Rate.
Through the kNN() function from the R package liver, we
calculate the probabilities for classification predictions in the
testing set.
Before deploying, ensure the models align with the project’s goals.
Here we report confusion matrices for each model through the
conf.mat.plot() function:
prob_regression_airbnb = predict(regress, test_set, type = "response")
prob_regression_airbnb = 1 - prob_regression_airbnb
conf.mat.plot(prob_regression_airbnb, actual_test, cutoff = 0.5, reference = FALSE, main = "Regression")prob_naive_bayes = predict(naive_bayes, test_set, type = "prob")[, 1]
conf.mat.plot(prob_naive_bayes, actual_test, cutoff = 0.5, reference = FALSE, main = "Naive Bayes")
Considering the three confusion matrices, the kNN algorithm seems to
have a better performance with only 504 wrong predictions, while the
naive Bayes algorithm has 906 and the logistic regression counts 859
wrong predictions
roc_naive_bayes = roc(actual_test, prob_naive_bayes)
roc_regression = roc(actual_test, prob_regression_airbnb)
roc_knn = roc(actual_test, prob_knn)
ggroc(list(roc_naive_bayes, roc_knn, roc_regression), size = 0.8) +
theme_minimal() + ggtitle("ROC plots with their AUC values") +
scale_color_manual(values = 1:3,
labels = c(paste("Linear Regression; AUC=", round(auc(roc_regression), 3)),
paste("kNN; AUC=", round(auc(roc_knn), 3)),
paste("Bayes; AUC=", round(auc(roc_naive_bayes), 3)))) +
theme(legend.title = element_blank()) +
theme(legend.position = c(.7, .3), text = element_text(size = 17))
The ROC plot shows that the algorithm that perfoms worst is the Naive
Bayes classifier. On the other hand, the kNN model and the logistic
regression model have close AUCs of respectively 0.852 and 0.815.
However, since the kNN model has a better performance considering
confusion matrices and we could not verify the assumptions of the
regression model, the kNN model emerges as the best out of the three for
this dataset.
With the current housing crisis and over-tourism being protested by locals in Barcelona, we wanted to investigate the role Airbnb plays, specifically how its transition into a professionalized business has impacted the availability of affordable housing. This led us to the research question: To what degree do the variables from Airbnb listings influence the prediction of host professionalization? To answer this question, we first investigated professionalized hosts’ rise and impact on short-term rentals. We also investigated what actions have already been taken against Airbnb by the city of Barcelona. Then we discussed potential determinants that could help identify professional hosts, i.e, those who run their Airbnb listings as a business. These determinants included price, number of bedrooms, neighborhood and Superhost status, which included ratings and response rates.
The model that was created in this study is highly valuable for users to improve transparency in their use of Airbnb, but also policymakers in order to make informed decisions about what further steps to take. As discussed before, the city of Barcelona has already implemented decisions to ban short-term holiday rental by 2028, but through this analysis it can provide a large and clear scope of the problem and the role Airbnb plays, and apply targeted regulations instead. By understanding where and how professional hosts operate and being able to identify them, the city can craft more targeted regulations to ensure that Airbnb does not excessively disrupt local housing markets. For example, cities like Barcelona may impose restrictions on the number of properties a single host can manage, implement mandatory registration, or set rental caps in high-demand areas to preserve residential communities.
Additionally, it can help indicate which neighborhoods are the most impacted through the analysis of the neighborhood variable. It could be used to map the concentration of Airbnbs to see which areas are most affected by the issue. If data shows that professional hosting is concentrated in certain neighborhoods, authorities can prioritize protection for those areas by enforcing stricter zoning laws or limiting new Airbnb listings to prevent displacement of local residents.
Also, by analyzing data on professional hosts—those who own multiple properties or run their rentals like a business—it becomes possible to assess how much of the Airbnb market is commercialized versus casual or occasional hosts. Knowing the scale of professional hosting helps policymakers understand the degree of this impact and identify where housing supply is most affected. By studying the operational behavior of professional hosts, cities can estimate the economic benefits generated through tourism and short-term rentals, such as tax revenue. Barcelona can use this information to design a fair taxation system for short-term rentals, ensuring that professional hosts contribute appropriately to the city’s economy while also funding programs to mitigate housing issues. Tax revenue from professional hosts can be funneled back into housing or public services, helping alleviate some of the negative externalities, such as rising costs of living and gentrification.
Abrate, G., Sainaghi, R., & Mauri, A. G. (2021a). Dynamic pricing in Airbnb: Individual versus professional hosts. Journal of Business Research, 141, 191–199. https://doi.org/10.1016/j.jbusres.2021.12.012
Abrate, G., Sainaghi, R., & Mauri, A. G. (2021b). Dynamic pricing in Airbnb: Individual versus professional hosts. Journal of Business Research, 141, 191–199. https://doi.org/10.1016/j.jbusres.2021.12.012
Barron, K., Kung, E., & Proserpio, D. (2020). The Effect of Home-Sharing on House Prices and Rents: Evidence from Airbnb. Marketing Science, 40(1), 23–47. https://doi.org/10.1287/mksc.2020.1227
Chang, C., & Li, S. (2020). Study of Price Determinants of Sharing Economy-Based Accommodation Services: Evidence from Airbnb.com. Journal of Theoretical and Applied Electronic Commerce Research, 16(4), 584–601. https://doi.org/10.3390/jtaer16040035
Chen, W., Wei, Z., & Xie, K. (2022). The Battle for Homes: How does home sharing disrupt local residential markets? Management Science, 68(12), 8589–8612. https://doi.org/10.1287/mnsc.2022.4299
Dann, T. T. F. H. D. (2017). PRICE DETERMINANTS ON AIRBNB: HOW REPUTATION PAYS OFF IN THE SHARING ECONOMY. https://addletonacademicpublishers.com/contents-jgme/1083-volume-5-4-2017/3067-price-determinants-on-airbnb-how-reputation-pays-off-in-the-sharing-economy
Deboosere, R., Kerrigan, D. J., Wachsmuth, D., & El-Geneidy, A. (2019). Location, location and professionalization: a multilevel hedonic analysis of Airbnb listing prices and revenue. Regional Studies Regional Science, 6(1), 143–156. https://doi.org/10.1080/21681376.2019.1592699
Garcia-López, M., Jofre-Monseny, J., Martínez-Mazza, R., & Segú, M. (2020). Do short-term rental platforms affect housing markets? Evidence from Airbnb in Barcelona. Journal of Urban Economics, 119, 103278. https://doi.org/10.1016/j.jue.2020.103278
Garz, M., & Schneider, A. (2023). Taxation of short-term rentals: Evidence from the introduction of the “Airbnb tax” in Norway. Economics Letters, 226, 111120. https://doi.org/10.1016/j.econlet.2023.111120
Hidalgo, A., Riccaboni, M., & Velázquez, F. J. (2024). The effect of short‐term rentals on local consumption amenities: Evidence from Madrid. Journal of Regional Science, 64(3), 621–648. https://doi.org/10.1111/jors.12685
Jiang, Haomin. (2023). “Airbnb Barcelona Dataset.” Accessed October 18, 2024. https://www.kaggle.com/datasets/haominjiang/airbnb-barcelona-dataset.
Miguel, C., Braje, I. N., Drotarova, M. H., Dumančić, K., Kirkulak-Uludag, B., & Giglio, C. (2024). The effects of the professionalization of hosting on service quality: Towards quality standards and certifications within the short-term rental market. International Journal of Hospitality Management, 122, 103796. https://doi.org/10.1016/j.ijhm.2024.103796
What’s required to be a Superhost - Airbnb Help Centre. (n.d.). Airbnb. https://www.airbnb.com/help/article/829#:~:text=Requirements%20to%20be%20a%20Superhost,-To%20be%20a&text=Hosted%20at%20least%2010%20reservations,Events%20or%20other%20valid%20reasons